RBCN: Rectified Binary Convolutional Networks with Generative Adversarial Learning 63
where RBConv denotes the convolution operation implemented as a new module, F l
in and
F l
out are the feature maps before and after convolution, respectively. W l are full precision
filters, the values of ˆW l are 1 or −1, and ⊙is the operation of the element-by-element
product.
During the backward propagation process of RBCNs, the full precision filters W and the
learnable matrices C are required to be learned and updated. These two sets of parameters
are jointly learned. We update W first and then C in each convolutional layer.
Update W: Let δW l
i be the gradient of the full precision filter W l
i . During backpropa-
gation, the gradients are first passed to ˆW l
i and then to W l
i . Thus,
δW l
i = ∂L
∂W l
i
= ∂L
∂ˆ
W l
i
∂ˆ
W l
i
∂W l
i
,
(3.67)
where
∂ˆW l
i
∂W l
i
=
⎧
⎨
⎩
1.2 + 2W l
i ,
−1 ≤W l
i < 0,
2 −2W l
i ,
0 ≤W l
i < 1,
10,
otherwise,
(3.68)
which is an approximation of 2× the Dirac delta function [159]. Furthermore,
∂L
∂ˆW l
i
= ∂LS
∂ˆW l
i
+ ∂LKernel
∂ˆW l
i
+ ∂LAdv
∂ˆW l
i
,
(3.69)
and
W l
i ←W l
i −η1δW l
i ,
(3.70)
where η1 is the learning rate. Then,
∂LKernel
∂ˆW l
i
= −λ1(W l
i −Cl ˆW l
i )Cl,
(3.71)
∂LAdv
∂ˆW l
i
= −2(1 −D(T l
i ; Y )) ∂D
∂ˆW l
i
.
(3.72)
Update C: We further update the learnable matrix Cl with W l fixed. Let δCl be the
gradient of Cl. Then we have
δCl = ∂LS
∂Cl + ∂LKernel
∂Cl
+ ∂LAdv
∂Cl
,
(3.73)
and
Cl ←Cl −η2δCl,
(3.74)
where η2 is another learning rate. Furthermore,
∂LKernel
∂Cl
= −λ1
i
(W l
i −Cl ˆW l
i ) ˆW l
i ,
(3.75)
∂LAdv
∂Cl
= −
i
2(1 −D(T l
i ; Y )) ∂D
∂Cl .
(3.76)
These derivations show that the rectified process is trainable in an end-to-end manner.
The complete training process is summarized in Algorithm 13, including how to update
the discriminators. As described in line 17 of Algorithm 13, we independently update other
parameters while fixing the convolutional layer’s parameters to enhance each layer’s feature
maps’ variety. This way, we speed up the training convergence and fully explore the potential
of 1-bit networks. In our implementation, all the values of Cl are replaced by their average
during the forward process. A scalar, not a matrix, is involved in inference, thus speeding
up computation.